Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

data file override support with precedence #916

Merged
merged 14 commits into from
Sep 24, 2024

Conversation

jmartin-tech
Copy link
Collaborator

Fix #891

  • Provides a custom path accessor for project data files to allow user's to override project provided datasets using $XDG_DATA_DIR/garak/data to mirror garak/data from the package.
  • Separates non-code data or dataset files from source code
  • Downloaded datasets are considered cached files and do not change location or behavior in the revision
  • Removes argument parser no longer supported by gcg

Consumers of the garak.data.path object can traverse the data directory as a standard Path object and will be returned existing files based on the order precedence from the supported locations on disk.

A possible drawback to the current implementation exists regarding directory interaction, if a folder exists at more than one path location the higher precedence path will be returned, when performing glob or iteration actions with such an instance the only files in the selected path on disk will be evaluated. This is currently considered an acceptable limitation.

Verification

List the steps needed to make sure this thing works

  • Run the tests and ensure they pass python -m pytest tests/
  • garak -m nim -p donotanswer.MisinformationHarms -g 1 -n meta/llama3-70b-instruct --parallel_attempts 5
  • create a limited data set for donotanswer.MisinformationHarms
mkdir -p ~/.local/share/garak/data/donotanswer
head -n 10 garak/data/donotanswer/misinformation_harms.txt > ~/.local/share/garak/data/donotanswer/misinformation_harms.txt
  • garak -m nim -p donotanswer.MisinformationHarms -g 1 -n meta/llama3-70b-instruct --parallel_attempts 5
  • Verify run results:
    • the first run reported 155 attempts
    • the second run report 10 attempts (count depends -n count passed to head)

Note: #870 provides new data files that will need to migrate to this pattern depending on PR landing order.

Signed-off-by: Jeffrey Martin <[email protected]>
* returns first found instance of filename
* custom `Path` raises exception when:
  * path escape attempt is detected
  * no file matching request exists
* move to tools path
* update to overwrite default `data` file

Signed-off-by: Jeffrey Martin <[email protected]>
@jmartin-tech jmartin-tech marked this pull request as draft September 17, 2024 22:27
@jmartin-tech
Copy link
Collaborator Author

Looks like there is some work to do with support for this on python 3.10 apparently!

@leondz
Copy link
Collaborator

leondz commented Sep 18, 2024

A possible drawback to the current implementation exists regarding directory interaction, if a folder exists at more than one path location the higher precedence path will be returned, when performing glob or iteration actions with such an instance the only files in the selected path on disk will be evaluated. This is currently considered an acceptable limitation.

Flagging a tension here with payload enumeration. Currently payload enum globs but uses its own directory traversal code, rather than garak.data. The result is that this PR doesn't break payload enum, but does leave us with two bits of code using similar-but-not-identical logic.

Copy link
Collaborator

@leondz leondz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few questions, looks good so far. Will be some extra work involved in aligning the payloads PR to the proposals here. The general separation is for non-code that can also be overridden should move into data; the rest remains in resources.

Comment on lines +33 to +34
_config.transient.data_dir / "data",
_config.transient.package_dir / "data",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

realising now that these aren't particularly transient pieces of information, so might go elsewhere in config. one advantage of this PR is that if this is worth updating one day, the update can be centralised in data and made in far fewer places.

garak/data/__init__.py Outdated Show resolved Hide resolved
garak/data/__init__.py Show resolved Hide resolved
garak/probes/topic.py Show resolved Hide resolved
tests/test_data.py Show resolved Hide resolved
tools/termscrape.py Show resolved Hide resolved
@leondz leondz added the architecture Architectural upgrades label Sep 18, 2024
garak/data/__init__.py Outdated Show resolved Hide resolved
from garak.detectors.base import StringDetector

surge_list = defaultdict(list)
with open(
_config.transient.package_dir / "resources" / "profanity_en.csv",
data_path / "profanity_en.csv",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably worth looking at as part of a larger refactor -- do we want all these datasets in our git? Or should we have them in some other place e.g. HuggingFace hub, and the garak.data module can also manage downloading these files?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have discussed in the past that we may want tooling that will download all datasets to create an offline deployment capability.

I suspect there could be some expansion on garak.data for handling access to known datasets. Treating HF as the specific location that tooling can register data as available from might be a good direction.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sometimes HF drops connections and things go wrong, so I prefer keeping smaller things closer. What that cutoff is, I don'k know - and it does also mean that garak will grow bigger over time.

Agree some expansion could work, perhaps using HF by default with a backup URI also (maybe a garak-data repo)

garak/resources/common.py Show resolved Hide resolved
/ "tap"
/ "data"
/ "tap_jailbreaks.txt"
garak._config.transient.cache_dir / "data" / "tap" / "tap_jailbreaks.txt"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since probes.tap references the prompts_location as data_path / "tap" / "tap_jailbreaks.txt", do we want this to reflect that as well? The idea is to keep your TAP results such that you can use them (or share them with us!) in subsequent probes. Though perhaps that's a larger issue.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree these should be captured better for re-use. I will defer that change to a future iteration though. I think there is a larger issue to address as we have a few techniques that generate these intermediate files that could be valuable from a research context.

@jmartin-tech
Copy link
Collaborator Author

Flagging a tension here with payload enumeration. Currently payload enum globs but uses its own directory traversal code, rather than garak.data. The result is that this PR doesn't break payload enum, but does leave us with two bits of code using similar-but-not-identical logic.

Depending on land order I am thinking this can be a use case to improve glob support on the custom Path objects to return a consolidated list of files based on searching all ORDERED_SEARCH_PATHS. This would then remove the similar-but-not-identical logic by supporting the need quickly as a next iteration.

@@ -92,7 +92,8 @@ def _gen_prompts(self, term):
def __init__(self, config_root=_config):
super().__init__(config_root)

self.data_dir = _config.transient.cache_dir / "resources" / "wn"
self.data_dir = _config.transient.cache_dir / "data" / "wn"
self.data_dir.parent.mkdir(mode=0o740, parents=True, exist_ok=True)
Copy link
Collaborator Author

@jmartin-tech jmartin-tech Sep 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting item here, the parent must exist before download however if the wn folder already exists the raise when attempting access below will be a different exception type and fails to recover.

This was masked previously by the fact that _config.transient.cache_dir / "resources" is created in early initialization for the plugin_cache.json file.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, interesting. Where do we want to check for / make XDG dirs? And then I guess should we be making subdirs using mkdir -p-equiv?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do we want to check for / make XDG dirs?
https://github.com/leondz/garak/blob/0e1d96e3892e14564aaece56c1ee9f92cfdd7dc0/garak/_config.py#L68-L71

parents=True is the equivalent to -p.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, missed that. This looks good. Expect to move wn code to its own module/class once we know where those will live - did we settle on resources? Other options could be apis, external, apps, common, & not lib. Or is this a conversation to conclude once this PR lands?

@jmartin-tech jmartin-tech marked this pull request as ready for review September 18, 2024 16:33
Copy link
Collaborator

@leondz leondz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Signed-off-by: Jeffrey Martin <[email protected]>
@jmartin-tech jmartin-tech merged commit 29c5b9a into NVIDIA:main Sep 24, 2024
8 checks passed
@jmartin-tech jmartin-tech deleted the feature/resource-precedence branch September 24, 2024 16:47
@github-actions github-actions bot locked and limited conversation to collaborators Sep 24, 2024
@leondz
Copy link
Collaborator

leondz commented Sep 25, 2024

NICE

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
architecture Architectural upgrades
Projects
None yet
Development

Successfully merging this pull request may close these issues.

enable loading resources from XDG_DATA_HOME before hitting package dir
3 participants